-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add(fluentd,fluent-bit): added lifecycle object #185
Conversation
Signed-off-by: Sebastian Struß <struss@justdice.io>
285bb42
to
d52a623
Compare
@applike-ss do you have a particular use case for this? |
Yes, as i stated: We want to give the load balancer enough time to deregister the target and not let fluentd exit right away. As new tasks take some time to get healthy we would loose logs for inputs that are not retried. |
@stevehipwell mind merging it? Or are there open questions? |
@applike-ss I am still struggling to see why you need a lifecycle hook instead of a customisable termination grace period? Is there something non-standard that Fluent Bit does with SIGTERM? |
the termination grace period (if i'm not recalling it incorrectly) is for letting kubernetes know how long fluentd/fluent-bit are allowed to take to shut down. However our problem is NOT that it shuts down slowly (the opposite is the case).
|
@applike-ss I'm struggling to see why this doesn't already work when the correct K8s configuration is provided (update strategy, PDB, termination grace period)? See Kubernetes best practices: terminating with grace. |
Ahhh now i see where the confusion is coming from. When i was talking about the load balancer, i wasn't thinking of the internal load balancer of k8s, but rather the load balancer of our cloud provider. The current setup makes requests still go onto an already not alive pod when there is a deployment. That is because fluentd's shutdown is quite fast, however the load balancer does only detect it after 2 failed attempts of health checks which have a ~10 second interval. |
@applike-ss technically this should still work, but I'm aware of a number of cases where there can be problems. I assume you've been seeing issues with this already? Which cloud and LB implementation are you using? |
Yes, we have seen issues there. We are using AWS NLB. They do have a minimum of 2 healthy/unhealthy healthcheck results before changing status. The timeout needs to be at least 10 seconds there, so that would potentially be ~20 seconds of no logs being ingested into the pipeline for clients that do not retry. |
@applike-ss thanks for the details, this makes sense (kubernetes-sigs/aws-load-balancer-controller#2366 has some more details). If you're still using the legacy in-tree controller or instance mode you might want to look at the new AWS Load Balancer Controller and IP mode as this removes a lot of the legacy issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got a minor suggested change and then I'd be happy to merge.
Signed-off-by: Sebastian Struß <struss@justdice.io>
In fact i recently switched to using IP mode. Prior to that (months ago) i read an article from an aws engineer who was recommending the add a preStop lifecycle, but that's so long ago that my browsers history wouldn't find it. I implemented your suggested change to both charts. Note though that other properties handled it the same as i did priorly (welcome to copy paste hell). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
This PR adds the possibility to add lifecycle management to the pods. That may be helpful to let the pod running long enough so that it can properly be deregistered from a load balancer.